perm filename VIS[0,BGB]4 blob
sn#073900 filedate 1973-11-25 generic text, type T, neo UTF8
COMMENT ⊗ VALID 00021 PAGES
RECORD PAGE DESCRIPTION
00001 00001
00003 00002 2.0 Computer Vision Theory.
00004 00003 2.1 Introduction to Computer Vision Theory.
00008 00004 2.2 Computer Vision Tasks.
00010 00005
00014 00006
00016 00007 TABLE OF 3-D COMPUTER VISION TASKS.
00019 00008 2.3 Mobile Robot Vision.
00021 00009
00023 00010 2.4 Vision Systems.
00030 00011 2.5 The Vision Cycle.
00032 00012 2.6 The Nature of Images.
00035 00013 2.7 The Nature of Worlds.
00037 00014 2.8 Locus Solving.
00039 00015 2.9 Related Work.
00043 00016 2.10 Computer Vision and Artificial Intelligence.
00047 00017
00051 00018 2.11 Visual Consciousness.
00055 00019 2.12 Summary of Arguments.
00056 00020 2.13 Future Vision Work.
00057 00021 2.X Social Consquences.
00059 ENDMK
⊗;
2.0 Computer Vision Theory.
2.1 Introduction.
2.2 Vision Tasks.
2.3 Mobile Robot Vision.
2.4 Vision Systems.
2.5 The Vision Cycle.
2.6 The Nature of Images.
2.7 The Nature of Worlds.
2.8 Locus Solving.
2.9 Related Work.
2.10 Computer Vision and Artificial Intelligence.
2.11 Visual Consciousness.
2.12 Summary of Arguments.
2.13 Future Vision Work.
2.1 Introduction to Computer Vision Theory.
Vision is the act or power of seeing. Computer vision
concerns programming a computer to do a task that demands the use of
an image forming light sensor, such as a television camera. The
theory I intend to elaborate is that normal vision is a continuous
process of keeping an internal visual simulator in sync with
perceived images of the external reality, for the sake of some goal.
In this chapter, several levels of theory are presented.
There is general theory, which is my interpretation of the state of
the art of computer vision. There is the special theory, which
lead to the particular design choices I have made. There are
alternate theories and designs, which are mentioned for the sake of
contrast. Finally, there is my personal world view on the nature of
visual perception and consciousness. The word "theory", as used
here, means simply a set of statements presenting a systematic view
of a subject. Specifically, I wish to exclude the connotations that
the theory is a mathematical theory or a natural theory. Perhaps
there can be such a thing as an "artificial theory" which extends
from the philosophy thru the design of an artifact.
Although, such an artificial theory is ultimately validated
by the successful production of the intended artifact; unvalidated
designs are compared by the usual tools of academic debate:
analogies, anecdotes, scenerios and rhetoric. In early 1942, there
were five ideas on how to manufacture fissionable material for a
bomb; three uranium isotope separation techniques: electomagnetic,
centrifuge and gaseous-diffusion; and two plutonium reactor
techniques: graphite and heavy water. In spite of the considerable
power of theory in nuclear physics, there was no a priori way to
select the best method; so all of the ideas were tried, and three of
the methods were made to work by 1945. In computer vision, there
are three substantially different approachs: description,
verification and recognition; all of which may ultimately work.
2.2 Computer Vision Tasks.
The overall vision research problem I wish to
discuss is that of finding out how to write programs that can see in
the real world. Alternate vision research problems include:
modeling human perception, solving visual puzzles, and developing
advanced automation techniques. In order to approach the problem,
specific programming tasks are proposed and solutions sought. Please
distingush the notion of a research problem from that of a
programming task. As will be illustrated, many vision tasks can be
effectively done without vision. The vision solution I seek must be
able to deal with real images, emphasize the continuity of the
visual process in time and space, and be general purpose rather than
ad hoc. These three requirements will be discussed again later, and
so for Mnemosyne, a slogan: Reality, Continuity, Generality. Now for
a quick survey of seven computer vision tasks [see table].
First, there is the robot chauffer task. In 1969, John
McCarthy asked me to consider the vision requirements of a computer
controlled car such as he depicted in an essay [see appendix 1]. The
idea is that a user of such an automatic car would request a
destination; the robot would select a route from an internally
stored road map; and then would proceed to its destination using
only visual data. The chauffer, is a subordinate part of the
McCarthy advice taker scenerio, about getting to the airport. The
problem involves representing the road map in the computer and
establishing the correspondence between the map and visual sight of
the road as the driver servo's the vehicle along the desired route.
Lacking a computer controlled car, the problem was immediately
simplified to that of tracing a route along the driveways and
parking lots that surround the Stanford A.I. Laboratory.
Second, there is the robot explorer. In 1967, McCarthy and
Lederberg, published a description of a robot for exploring the
surface of the planet Mars. The robot explorer depicted, was
designed to run for long periods of time without human intervention
because of the signal transmission time to Mars is as great as forty
minutes and because the 23.5 hour Martian day would place the
vehicle out of sight for 12 hour at a time. The later difficulty
could be overcome by a having a set of communication relay
satellites in orbit around Mars. The task of the explorer would be
to drive around mapping the surface of Mars, looking for interesting
features, and doing various experiments.
The third vision task is that of the robot soldier, tank or
sentry. The problem has several forms which are quite similar to the
chauffeur, the explorer and the machine assembler. Although this
vision task has not being explicitly attempted, the reader should
note that a thorough solution to any of the other tasks almost
assures the technology to solve task 3. See section 2.X for futher
discussion of social implications.
Fourth, the turn table task in to construct a 3-D model from
a sequence of 2-D television images taken of an object rotated on a
turn table. The turntable task was intentionally selected as a
simplification of the explorer task.
Fifth, the classic blocks vision task, first attempted by
Roberts, consists of two parts: first convert a video image
into a line drawing; second, make a selection from a set of predefined
prototype models of blocks that accounts for the line drawing.
[single image vs. multiple images].
[perfect line drawing puzzles: Guzman & Waltz].
[imperfect line drawing analysis]
Sixth, recognition vision tasks include:
character recognition, face recognition, aircraft recognition,
Seventh, the Stanford Hand Eye Project has
recently dedicated itself to solving the task of automatic
machine assembly. In particular, the group will try to develope
techniques that will be demonstrated by the fully automatic
assembly of a chain saw gasoline engine.
Where is the part ? and where is the hole ?
Location Task: Where is it.
Identification Task: What is it.
Eighth, there are animal vision tasks.
TABLE OF 3-D COMPUTER VISION TASKS.
---------------------------------------------------------------------
1. The Robot Chauffeur. Cart Task.
Given a computer controlled cart and a road map,
drive the cart along a preselected route,
without crashing into anything.
2. The Robot Explorer. Cart Task.
Given a computer controlled cart,
explore and map the world,
without crashing into anything.
3. The Robot Soldier. Cart Task.
Given a computer controlled vehicle,
locate and destroy the enemy.
4. Turn Table Task.
The turn table task in to construct a 3-D model from a
sequence of 2-D television images taken of an object
rotated on a turn table.
5. The Blocks Task.
First, convert a video image into a line drawing;
Second, identify and locate the blocks in the line drawing.
6. Recognition Tasks.
Character recognition,
Face recognition,
Aircraft recognition,
7. Machine Assembly Tasks.
Where is the part ? Where is the hole ?
Location Task: Where is it.
Identification Task: What is it.
---------------------------------------------------------------------
2.3 Mobile Robot Vision.
---------------------------------------------------------------------
Chauffer Cart Solutions:
1. Map, predict 2D image, verify features, and solve for camera locus.
2. Map, retrieve 2D image, verify features,and solve for camera locus.
---------------------------------------------------------------------
Explorer Cart Solutions:
1. Photoreconnaissance, correlation and make contour maps.
2. Photoreconnaissance, match and describe, body locus solving.
---------------------------------------------------------------------
I will now propose two solutions each for the two cart
tasks: chauffer and explorer; that is four systems in all.
With abundant naivete and energy, I coded a simple
edge finder, which actually could find the left and right curbs of
the road in some televison images; however (sad to relate) the edge
finder was easily fooled and so to make it smarter I began to
put in a model of the given road. What followed then, was nearly
four years of trying to do really good world modeling for the sake
of computer vision by verification.
The cart at the Stanford Artificial Intelligence Laboratory
is intended for outdoors use and consists of four bicycle wheels, a
piece of plywood, two car batteries, a television camera, a
television transmitter, and a toy airplane radio receiver. The
vehicle being discussed is not "Shakey", which belongs to the
Stanford Reseach Institute's Artificial Intelligence Group. There
are two A.I. labs near Stanford and each has a computer controlled
vehicle. Logically the cart has three motors which can be commanded
to run in one or the other direction under computer control. The
six possible cart actions are: run forwards, run backwards, steer to
the left, steer to the right, pan camera to the left, pan camera
to the right.
2.4 Vision Systems.
A computer vision system can be described as mediating
between external perceived images and an internal world model. The
two poles (or operands) of the system are called the "bottom" for
images and the "top" for the models. The "world model" operand can
be identified even in vision systems that do not advertise it. Work
that truly lacks a world model is not computer vision, usually it
is image processing. Given the two classes of operands, images and
worlds; there are three operations: recognition, verification and
description; which a general vision system may perform.
Verification vision is also called top-down or model-driven
vision. The verification approach involves predicting an image,
followed by comparing the predicted image and a perceived image for
slight differences which are expected but not yet measure.
Recognition vision and descriptive vision are also called bottom-up
or data-driven vision. Recognition vision is qualitative, what is in
the picture is determined by extracting a set of features
(qualities) and by classifing them according to a essentially
statistical world model. Description vision is quantitative. Many
theories are superficially different in that they consist of
compounding the three basic modes of vision, or by using different
forms of the two basic elements: image and model.
The Vision Mandala.
1. PREDICT 2D → 3D synthesis verification
2. PERCEIVE 3D → 2D analysis revelation
3. COMPARE recognition
Three modes of operation on the vision cycle.
1. Revelation Vision - Data Driven Vision.
(nearly pure bottom up vision).
2. Verification Vision - Model Driven Vision.
(nearly pure top down vision).
3. Recognition Vision - feature classification.
(bottom up random access into existing top).
Vision.
Heuristic Vision - guess and test.
Accomodating Vision.
(first bottom-up, next top-down, then verify and correct).
---------------------------------------------------------------------
The vision system is:
1. Continuous rather than discrete.
2. Exact rather than fuzzy; numeric rather than symbolic.
3. Bidirectional rather than one way.
2.5 The Vision Cycle.
The vision mediation has three possible modes:
revelation, verification and recognition.
Depending on circumstances, a vision system should be able to
run almost entirely top-down (verification vision) or bottom-up
(revelation vision). Verification vision is all that is required in
a well know and consquently predictible environment; whereas
revelation vision is required in a brand new or rapidly changing
environment.
Recognition involves comparing
perceived data with predicted data; such recognition comparing can
be done on any of the four types of 2-D images or the 3-D models.
Arcane recognition techniques can be avoided by improving the
prediction and the analysis so that matchs are nearly obvious.
2.6 The Nature of Images.
There are three basic kinds of information in a 2-D visual
image: photometric, geometric, and topological; also there are
three kinds of 2-D images: raster, contour, and mosaic.
The traditional subject of image processing involves the study and
development of programs that enhance, transform and compare 2D
images. Nearly all such image processing work can be subsumed into
computer vision.
---------------------------------------------------------------------
Assumption: The perceived images are low quality, black and white,
digitized television images.
Alternatives: 1. High quality electronic imaging device.
2. Film scanning system.
3. Active 3-D imaging device.
4. Non-light devices: sound, radar, neutrinoes, etc.
Discussion:
The argument in favor of using low quality, black and white,
television images is based on poverty rather than principle. Low
quality television is the cheapest electronic way to perceive an
image in real time.
Although, a super intellectual entities would have eyes that
could see the whole electromagnetic spectrum from gamma radiation to
direct current as well as "voices" that could broadcast on any and
all frequency; the video restriction
---------------------------------------------------------------------
An image contains three basic kinds of data:
topological data, geometric data, and photometric data.
The quality of the particular computer vision system
that one is condemned to use is a great influence one's
theoretical approach.
size of image
photometric accuracy, bits per pixel
resolution
speed of image taking
2.7 The Nature of Worlds.
The rules about the world that can be assumed a priori by a
programmer are the laws of physics; programming a Newtonian
simulator of the mundane physical world to a given approximation is
difficult but more fruitful than programming an Aristolean
simulator.
(Reality Simulation).
---------------------------------------------------------------------
Assumption: The visual world model should be a 3-D geometric model.
Alternatives: 1. Image memory and 2-D models.
2. Procedual Knowledge, e.g. Hewett & Winograd.
3. Semantic knowledge, e.g. Wilkes.
4. Formal Logic models, e.g McCarthy & Hayes.
5. Statistical world model, e.g. Duda & Hart.
Discussion:
---------------------------------------------------------------------
Assumption: Partial knowledge is represented by approxination.
Alternatives: 1. Tree of possibilties.
2. Multi valued logic.
3. Probablities.
Discussion:
---------------------------------------------------------------------
2.8 Locus Solving.
1. Camera Locus Solving.
2. Body Locus Solving.
Silhouette Cone Intersection.
Envelope bodies.
3. Sun Locus Solving.
(compute it, look at it, shine and shadows).
The crux of computer vision is to deduce information about
the world being viewed from images of that world. The world
information most directly relevant to vision is the physical
location, extent and light scattering properties of solid opaque
objects; the location, orientation and scales of the cameras that
takes the pictures; and the location and nature of the lights that
illuminate the world. Accordingly, three important vision problems
are camera solving, body solving, and sun solving.
The macroscopic world doesn't change very rapidly; between any two
world states there is an intermediate world state. Parallax is the
principal means of depth perception. Parallax is the alchemist that
converts 2-D images into 3-D models. Revelation vision is a process
of comparing percieved images taken in sequence and constructing a
3-D model of the unanticipated objects.
2.9 Related Work.
Larry Roberts is justly credited for doing the seminal work
in Computer Vision; and although his thesis appeared over ten years
ago the subject has languished dependent on and overshadowed by the
four areas called: Image Processing, Pattern Recognition, Computer
Graphics, and Artificial Intelligence. Outside of the computer
sciences the two subjects: psychology are neurology, also seek a
theory of vision. I will breifly state the relevant aspects of
computer vision in each of these six subject areas; and second
acknowledge the particular authors that influenced my work.
---------------------------------------------------------------------
(Computer Vision and Image Processing).
Image processing involves the study and development of
programs that enhance, transform and compare 2D images. Nearly all
image processing work can eventually be applied to computer vision.
---------------------------------------------------------------------
(Computer Vision and Pattern Recognition).
Image pattern recognition involves two steps: feature extraction
and classification.
---------------------------------------------------------------------
(Computer Vision and Computer Graphics).
Discriptive computer vision is the inverse of computer
graphics. The problem of computer graphics is to synthesis images
from three dimensional models; the problem of discriptive computer
vision is to analyze images into three dimensional models.
---------------------------------------------------------------------
(Computer Vision and Artificial Intelligence).
At one extreme, computer vision may be discribed as merely
the problem of getting visual input hardware properly connected to a
computer; once the computer can "see" a raster of intensities in its
memory, the rest of the problem is artificial intelligence. The
other extreme is harder to depict because it requires figuring where
to draw the line between vision software and intelligence software;
one goal I wish to pursue in this chapter is demark such a line.
2.10 Computer Vision and Artificial Intelligence.
A favorite pastime of technology aficionados
consists of defining the term "Artificial Intelligence".
The founders Minsky and McCarthy coined the phrase;
critics such as Lighthill and Dreyfus
and advocates Nilsson and Fiegenbaum.
Futurologists such as Herman Kahn, use the term in sentences such as
"True artificial intelligence will not appear until around 2020";
which would seem to leave us, twentieth century people,
with {artificial} artificial intelligence.
Normal vision, as oppose to visual puzzles, is not an
Artificial Intelligence problem in the sense that it does not
involve cognition, verbal abstraction, symbolism,
theorem proving, game playing, planning, heuristic programming or
self programming. In fact, I feel that computer vision, list
processing and symbolic integration will drop out of Artificial
Intelligence.
"The history of progress in the development of systems for automatic
symbolic integration poses an interesting question about the
definition of artificial intelligence. Few would argue that Slagle's
SAINT program was a product of artificial intelligence research.
Moses' SIN program for symbolic integration seldom needed to resort
to search, and for this reason some people consider it much more
powerful (intelligent ?) than SAINT. Now, Risch (1969) has developed
an algorithm for integrating many types of expressions. Risch
considers himself a mathematician, not an artificial intelligence
researcher. In your opinion should Risch's algorithm be considered
part of the subject matter of artificial intelligence ? If you would
exclude Risch from artifial intelligence, how would you respond to
the statement that every artificial intelligence program might
eventually be dominated by a (more intelligent?) non artificial
intelligence algorithm? If you would include Risch, would you also
include the long-division algorithm?"
- Nils J. Nilsson, problem 4-5;
Problem-Solving Methods in Artificial Intelligence.
In answer to Nilsson's problem, I would exclude Risch from
Artificial Intelligence and cheerfully look forward to the remote
day when all A.I. problems are superceded by specific programming
techniques.
(Fiegenbaum Quote).
The relation between Artificial Intellegence, experiment,
and environmental simulation is indirectly illuminated by
Fiegenbaum's observation:
"The design, implementation, and use of the robot hardware
presents some difficult, and often expensive, engineering and
maintenance problems. If one is to work in this area solving such
problems it is a necessary prelude but, more often than not,
unrewarding because the activity does not address the questions of
A.I. reseach that motivate the project. Why, then, build devices?
Why not simulate them and their environment? In fact, the SRI group
has done good work in simulating a version of their robot in a
simplified environment. The answer given is as follows. It is felt
by the SRI group that the most unsatisfactory part of their
simulation effort was the simulation of the environment. Yet, they
say that 90% of the effort of the simulation team went into this
part of the simulation. It turned out to be very difficult to
reproduce in an internal representation for a computer the necessary
richness of environment that would give rise to interesting behavior
by the highly adaptive robt. It is easier and cheaper to build a
hardware robot to extract what information it needs from the real
world than to organize and store a useful model. Crudely put, the
SRI group's argument is that the most economic and efficient store
of information about the real world is the real world itself."
- E. A. Fiegenbaum [ref. X].
Fiegenbaum's final statement is correct: the real world is a
good memory of itself; but his conclusion is in error, because it is
necessary to have an environmental simulator in order to read the
world. His opinion, that the building of the robot hardware is not
an integral part of the A.I. research; is very characteristic of
senior Artificial Intelligence theorists and leaves the junior A.I.
experimentalists with the curse of shoddy tools.
2.11 Visual Consciousness.
"For the purpose of presenting my argument I must first explain the
basic premise of sorcery as don Juan presented it to me. He said
that for a sorcerer, the world of everyday life is not real, or out
there, as we believe it is. For a sorcerer, reality or the world we
all know, is only a discription. For the sake of validating this
premise don Juan concentrated the best of his efforts into leading
me to a genuine conviction that what I held in mind as the world at
hand was merely a description of the world; a description that had
been pounded into me from the moment I was born."
- Carlos Castaneda. Journey to Ixtlan.
The larger context of a vision theory depends on ones'
opinion about human counsciousness. In my opinion, mind is a program
that is running in the brain.
Now consider what software is needed to account for
counsciousness, the private life of the self that burns in our
heads ? The so called stream of counsciousness consists of little
voice(s) talking, fragments of music playing, and a color visual
display of the present moment. I believe that the major computation
being performed by an intellectual entity in order to stay
counscious of its external world is a reality simulation.
The basic inspiration for this thesis is a subtle
analogy between 3-D computer graphics and human vision. First
consider computer graphics, it is possible to program a computer to
simulate the view of a camera moving thru a simulated scene.
Architects look at simulated building designs, cartoonist look at
computer simulated commercials,
and pilots look at simulated aircraft
carriers. Second, the
position of the simulated camera can be controlled either by direct
command or indirectly by a further simulation, such as of an
airplane. In the 3-D display system, at the University of Utah, the
position of the simulated camera is kept coincident with the
physical position of the eyes of the viewer.
Now consider human vision. You are where your eyes are.
The analogy is that the display simulator resembles
the visual display that goes on inside ones head. The subtlty lies in
identifying analogous elements.
{introspection & mimicry arguments: for and against}.
2.12 Summary of Arguments.
Vision Problems vs. Vision Tasks.
Discussion of visual tasks.
THREE REQUIREMENTS: (OF A VISION THEORY).
1. REALITY.
Preference for working with real images rather than
with puzzle images (i.e. perfect images).
2. GENERALITY.
Preference for the descriptive approach rather than the
classification model.
3. CONTINUITY.
Preference for vision in continuous time and space
rather than discrete vision.
THREE MODES: (OF A VISION SYSTEM).
1. REVELATION ≡ DESCRIPTION.
2. VERIFICATION.
3. RECOGNITION.
Argument against a "Vision Language";
2.13 Future Vision Work.
Significant progress in computer vision will have to await
better computer hardware and better computer graphics software,
specifically world modeling software.
At Stanford Uninversity, Lynn Quam and Hans Morevac,
The machine assembly tasks,
At Stanford Research Institute
because the demand for doing practical vision tasks can be satisfied
with existing ad hoc methods or by not using a visual sensor at all.
The potential of a computer entertainment industry...
As William Shakespeare and Carl Hewett would agree:
all the world's a stage and all the men, women and robots actors.
2.X Social Consquences.
Although, the political and social consquences of computer
vision are somewhat more remote than other computer applications,
the potential for abuse is so great that I feel that it is necessary
to try to develope corresponding ethics along with the science and
technology.
During the period of this reseach, 1969 to 1973 inclusive,...
An exceedingly good and exact theory of vision could generate....
The potential benefits of understanding vision
outweigh the potential harm that could be wrought.
As an engineering project, the construction of a killer robot
is safer than research in biological warfare
and definetly not in the same league as the construction of a doomsday
machine or even as the invention of nuclear weapons.